First I will take a look at dimensions, column names, structure and summary of the dataset.
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The quality is an integer value with median 6 and mean 5.878. Now I will plot the histogram for quality to ascertain the type of distribution.
We can see that most of the quality of values are between 5 and 7. The maximum and minimum value of quality is 3 and 9 respectively. I belive that quality should be an ordered factor since the values are discrete and go from high to low. I will do the conversion to ordered factor now.
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
Now I will look at histograms of all the other variables starting with fixed acidity to determine their distributions.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Adjusting the binwidth and removing outliers from the above graph.
The fixed acidity clearly follows a normal distribution with mean 6.855. The median value is 6.8.
From now on I will preadjust the binwidth and remove outliers from histograms.
Moving on to volatile acidity
Even this distribution is approximately normal. The mean volatile acidity is 0.2782 while the median is 0.26.
Now I will plot the distribution for citric acid.
Citric acid too follows a normal distribution with mean 0.3342 and median 0.32.
Moving on to residual sugar.
This distribution is highly right skewed which can be confirmed by the fact that there is a high difference between mean (6.391) and median (5.2). I will apply a log transformation now to try and make this into a normal distribution.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
It can be observed that residual sugar has a bimodal distribution after log transformation.
Now I will plot a histogram of chlorides.
Even this distribution is highly right skewed which. The mean and median values of chlorides are 0.04577 and 0.043 respectively. I will now apply axis transformation to try and convert this to a normal distribution.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
Chlorides has now been converted to a normal distribution after log scaling.
Moving on to free sulfur dioxide.
Free sulfur dioxide has a normal distribution with median 34.00 and mean 35.31.
Moving on to toal sulfur dioxide.
Total sulfur dioxide also has a normal distribution. The median and mean values are 134 and 138.4 respectively.
Plotting the histogram for density now.
Even density follows and approximate normal distribution with mean 0.994 and median 0.9937.
Moving on to pH now.
pH clearly has an almost perfect normal distribution which is validated by the fact that mean (3.188) and median (3.180) are almost similar.
Plotting the histogram for sulphates now.
Sulphates also follows an approximate normal distribution. The mean and median values are 0.4898 and 0.47 respectively.
Moving on to the last variable alcohol.
Alcohol does not follow a perfect normal distribution but we can approximate it as such. This can be validated by the fact that there is little difference between mean (10.51) and median (10.4).
I would also like to create a new variable for bound sulfur dioxide since we have variables for total and free sulfur dioxide. This new variable could be useful in future analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
Bouns sulfur dioxde also follows a normal distribution with mean 103.1 and median 100.
There are 4898 white wine observations with 12 variables for each one. One of the variable, quality, can be considered an ordered factor since it only has discrete integer values ranging from 3 to 9. All other variables are quantitative features with number ranges.
We want to determine a model for predicting quality so quality is of course the most important feature. Other than that I believe that alcohol level will play a significant part in determining the quality of wine.
According to research, residual sugar and suphates plays a big role in determine quality of the wine. I expect these features to support my investigation into the feature of interest which is quality.
I created bound sulphur dioxide from two existing variables, free sulphur dioxide and total sulphur dioxide since it could help me understand the dataset further and play a big part in future analysis. I also changed quality to an ordered dataset.
The residual sugar and chlorides were the only unusual distributions that didn’t look normal. I transformed these variables to a logarithmic scale since both were highly right skewed. The residual sugar converted to a clear bimodal distribution while chlorides had a normal distribution after transformation.
First I will create a scatterplot matrix using ggpairs to explore all the data in one chart.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Below variables have the most significant correlation with quality.
Now we will further explore the relationship between these three variables and quality through box plots and scatter plots.
To make the plots clearer I will add jitter, remove outliers and add transparency. The resultant plots are below.
The quality boxplot shows that when quality increases from 5 to 9, alcohol level also rises slightly with it. This explains the strong correlation between both features. A slightly upward trend in the dense part can also be observed from the scatter plot. Chlorides and density have a looser negative correlation with quality as compared to alcohol. Still a slight decreasing trend can be observed from the above plots.
Now we will move on to examining relationships between features other than quality that have a strong correlation i.e higher than 0.5 in either direction.
The correlation between residual sugar and density is 0.839. Sugar is more dense than other ingredients in the wine. Thus higher sugar levels will lead to higher density which is apparent from the above plot. Also alcohol is less dense as compared to water. So the correlation of -0.78 between alcohol & density and the decreasing trend in the scatter plot makes sense. Bound sulfur dioxide has a high positive correlation with density equal to 0.504. This could be because bound sulfur dioxide also has a negative correlation with alcohol equal to -0.449.
Other than this there is high correlation between total sulfur dioxide & free sulfur dioxide (0.616) and total sulfur dioxide & bound sulfur dioxide (0.922). This is expected since the variables are dependent on each other.
I evaluated various features against the feature of interest in the dataset. The feature of interest, quality, had a relatively strong correlation with alcohol, density and chlorides. Alcohol has a positive correlation while chlorides and density have a negative correlation with quality. Although none of these correlations are exactly linear as can be observed from the box plots.
Density had interesting relationships with multiple variables. Density increased with increasing sugar and bound sulfur dioxide levels while decreased with increasing alcohols levels. This can be explained by higher density of sugar and lower density of alcohol compared to other ingredients.
Other than that bound sulfur dioxide has a negative correlation with alcohol and a positive correlation with density. I believe that during the fermentation process when more and more sugar is converted to alcohol, the levels of bound sulfur dioxide also decrease along with sugar levels.
The strongest relationship I found was between bound sulfur dioxide and total sulfur dioxide. This is because both these variables are codependent.
For the multivariate analysis I will divide the alcohol levels into an ordered factor by dividing it into buckets. Then I will plot line graphs to determine the relation of quality with median density and median chlorides for different alcohol level. This will give us further insights into our feature of interest.
## 'data.frame': 4898 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ quality.num : int 6 6 6 6 6 6 6 6 6 6 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
## $ alcohol.bucket : Ord.factor w/ 4 levels "(7,9.5]"<"(9.5,10.4]"<..: 1 1 2 2 2 2 2 1 1 3 ...
The first chart clearly shows that the median density decreases as the alcohol level increases. For lower alcohol levels, density decreases with increasing quality and the trend is consistent across alcohol levels since the lines don’t overlap and follow similar slopes. The trend is more random in higher alcohol levels. The second chart is a bit more complicated In general, the median level of chlorides is higher when the alcohol level is lower. However, this is not the case for the lowest quality level of 3. This might be due to noise in the data since there are only 20 observations of wine with quality 3.
Now I would like to explore one more relationship before concluding the analysis. I will plot a scatterplot between bound sulfur dioxide and alchohol for different quality levels. For this I have divided the quality in to two buckets, (2,5] and (5,9].
## 'data.frame': 4898 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ quality.num : int 6 6 6 6 6 6 6 6 6 6 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
## $ alcohol.bucket : Ord.factor w/ 4 levels "(7,9.5]"<"(9.5,10.4]"<..: 1 1 2 2 2 2 2 1 1 3 ...
## $ quality.bucket : Ord.factor w/ 2 levels "(2,5]"<"(5,9]": 2 2 2 2 2 2 2 2 2 2 ...
We can clearly see that there is a negative correlation between alcohol and bound sulfur dioxide for lower quality wines Also lower quality wines have higher bound sulfur dioxide content and lower alcohol levels generally. On the other hand there is no strong negative correlation observed between bound sulfur dioxide and alcohol for higher quality wines. The bound sulfur dioxide content is in the same range for all alcohol levels for higher quality wines.
I was able to explore the relationship of the feature of interest with other variables in detail. Visualizing the relationships between density, chlorides, alcohol and quality concisely allowed me to evaluate them at a deeper level. I determined that the relationship between density and alcohol stays consistent for all quality values. However, the relationship between chlorides and alcohol may change based on the quality.
The interaction between bound sulfur dioxide and alcohol was the most interesting. For lower quality wines there was a negative correlation between bound sulfur dioxide and alcohol levels. But there was no similar trend in higher quality wines. Infact for higher quality wines the bound sulfur dioxide content was more or less in the same range.
I did not create any model with the dataset. The only features of the dataset that could be modeled are the correlations between density and alcohol and density and residual sugar since only these features had strong enough correlation between them. These relationships are of no interest since they can be explained by simple science and don’t contian the feature of interest, quality.
This is the most informative plot in the dataset, clearly showing the relationship between alcohol content and wine quality. The five boxplots show the alcohol content dropping over wines of quality 3, 4 and 5 before rising steeply again in wines of quality 6, 7 and 8. I have further improved the plot by adding color and proper labels to it.
This plot shows another important relationship of our feature of interest, quality. This chart shows that the median density decreases as the alcohol level increases. Also for lower alcohol levels, density decreases with increasing quality and the trend is consistent across alcohol levels. The trend is more random in higher alcohol levels.
This chart is the most intersting and surprising to me. It shows the relationship between bound sulfur dioxide and alcohol over at different quality levels. Overall there is a fairly string negative correlation between bound sulfur dioxide and alcohol. We can see from the chart that for lower quality levels there is a strong negative correlation between bound sulfur diocide and alcohol. But what is most surprising is that for higher quality levels the negative correlation is much less strong.
This was a great learning exercise for me. In simple words, I learnt how to explore a huge dataset and draw conclusions about relationships between different variables in the dataset.
My major focus in this study was to explore the relationship of quality with other variables in the dataset. Quality has strong correlations with density, chloride and alcohol levels. I was able to successfully explore how quality changes with these variables and draw conclusions about their behaviour.
The univariate and bivariate sections of the analysis were straightforward. But I faced challenges in the multivariate section. When an analyst is evaluating multiple variables at once, there are countless possibilities for structuring the visualization and there is a multitude of variable combinations to investigate. I was able to overcome this difficulty by focusing majorly on the feature of interest and building upon the analysis I did in the bivariate section. Creating a predictive model for quality was also a huge challenge since quality did not have strong enough correlations with any of the other variables.
The most obvious next step in the analysis would be compare this data with the red wine data and find out similar and conflicting trends in determining quality. This will help us in drawing further conclusions. Also a predictive model for quality could be built using machine learning. The more complicated techniques in machine learning will come in handy while dealing with a large number of variables with loose correlation with quality.
Looking back, this was a wonderful exercise to practice my exploratory data analysis abilities while discovering new insights about the world of wines at the same time.